• Àüü
  • ÀüÀÚ/Àü±â
  • Åë½Å
  • ÄÄÇ»ÅÍ
´Ý±â

»çÀÌÆ®¸Ê

Loading..

Please wait....

¿µ¹® ³í¹®Áö

Ȩ Ȩ > ¿¬±¸¹®Çå > ¿µ¹® ³í¹®Áö > TIIS (Çѱ¹ÀÎÅͳÝÁ¤º¸ÇÐȸ)

TIIS (Çѱ¹ÀÎÅͳÝÁ¤º¸ÇÐȸ)

Current Result Document :

ÇѱÛÁ¦¸ñ(Korean Title) A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document
¿µ¹®Á¦¸ñ(English Title) A Machine-Learning Based Approach for Extracting Logical Structure of a Styled Document
ÀúÀÚ(Author) Tae-young Kim   Suntae Kim   Sangchul Choi   Jeong-Ah Kim   Jae-Young Choi   Jong-Won Ko   Jee-Huong Lee   Youngwha Cho  
¿ø¹®¼ö·Ïó(Citation) VOL 11 NO. 02 PP. 1043 ~ 1056 (2017. 02)
Çѱ۳»¿ë
(Korean Abstract)
¿µ¹®³»¿ë
(English Abstract)
A styled document is a document that contains diverse decorating functions such as different font, colors, tables and images generally authored in a word processor (e.g., MS-WORD, Open Office). Compared to a plain-text document, a styled document enables a human to easily recognize a logical structure such as section, subsection and contents of a document. However, it is difficult for a computer to recognize the structure if a writer does not explicitly specify a type of an element by using the styling functions of a word processor. It is one of the obstacles to enhance document version management systems because they currently manage the document with a file as a unit, not the document elements as a management unit. This paper proposes a machine learning based approach to analyzing the logical structure of a styled document composing of sections, subsections and contents. We first suggest a feature vector for characterizing document elements from a styled document, composing of eight features such as font size, indentation and period, each of which is a frequently discovered item in a styled document. Then, we trained machine learning classifiers such as Random Forest and Support Vector Machine using the suggested feature vector. The trained classifiers are used to automatically identify logical structure of a styled document. Our experiment obtained 92.78% of precision and 94.02% of recall for analyzing the logical structure of 50 styled documents.
Å°¿öµå(Keyword) Logical Structure Analysis   Machine Learning   Feature Vector   Document Management System  
ÆÄÀÏ÷ºÎ PDF ´Ù¿î·Îµå